We introduce a general framework for visual forecasting, which directlyimitates visual sequences without additional supervision. As a result, ourmodel can be applied at several semantic levels and does not require any domainknowledge or handcrafted features. We achieve this by formulating visualforecasting as an inverse reinforcement learning (IRL) problem, and directlyimitate the dynamics in natural sequences from their raw pixel values. The keychallenge is the high-dimensional and continuous state-action space thatprohibits the application of previous IRL algorithms. We address thiscomputational bottleneck by extending recent progress in model-free imitationwith trainable deep feature representations, which (1) bypasses the exhaustivestate-action pair visits in dynamic programming by using a dual formulation and(2) avoids explicit state sampling at gradient computation using a deep featurereparametrization. This allows us to apply IRL at scale and directly imitatethe dynamics in high-dimensional continuous visual sequences from the raw pixelvalues. We evaluate our approach at three different level-of-abstraction, fromlow level pixels to higher level semantics: future frame generation, actionanticipation, visual story forecasting. At all levels, our approach outperformsexisting methods.
展开▼